Problem

We have a list of header names (e.g. "REFERENCENUMBER", "document_id", "buyer", ...). We want to find out which of these header names belong to a given entity (e.g. "buyer" or "award"). In general, an entity is composed of multiple header names and multiple entities can be contained in a single set of header names.

The classification algorithm should accomplish two things:

For a list of header names, return the entity types that are contained within (multilabel classification)
For a given header name, return the entity types to which it could belong



In [158]:

    
from words import split_words



In [185]:

    
import samples
reload(samples)
import samples
#This returns a dict containing all header name combinations for a given entity type.
samples = samples.load_samples_by_entity(["Keywords", "UK", "Georgia", "Canada", "Mexico", "EU"], cache=True)
headers = samples.keys()
print "Entity types:",", ".join(samples.keys())









    



Entity types: notice, good, solicitation, authority, supplier, contract, buyer, ?



In [186]:

    
#Example: header name combinations for 'buyer' entity
print dict(samples)['buyer']









    



[['purchaser', 'procurer', 'requirements', 'code, sponsor'], ['ORG_NAME', 'ORG_CONTACTEMAIL', 'REGION'], ['procurring_entity_id', 'procurer_name', 'procurer_code'], ['end_user_entity', 'customer_info'], ['GOVERNMENT', 'ACRONYMS', 'UNIT', 'CompraNet Unit Identifier', 'CompraNet Unit name', 'Responsibility', 'Unit Branch'], ['document_country']]

For the classification, we first use a DictVectorizer to generate a sparse feature matrix from the header name occurences. Then, we use a linear support vector classifier to classify the headers.



In [188]:

    
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.feature_extraction import text,DictVectorizer

pipe = Pipeline([ 
        ('hv', DictVectorizer()), 
        ('svm', LinearSVC()),
    ])



In [190]:

    
#We generate the input training data set.
#For each entity type, we generate a dict containing header name occurences for a given entity type and sample.
from collections import defaultdict
counts = []
entities = []
for entity,headers_list in samples.items():
    for headers in headers_list:
        counts.append(dict([(header,1) for header in headers]))
        entities.append(entity)



In [269]:

    
#We fit the counts to the entities
pipe.fit(counts,entities)

#We predict the type of a given header name
pipe.predict([{'NOTICETYPE' : 1}]),entities[30]









    Out[269]:





(array(['notice'], 
      dtype='|S12'), 'contract')



In [274]:

    
from sklearn import cross_validation

rs = cross_validation.ShuffleSplit(len(counts), n_iter=2, train_size=0.75, test_size=.25)

for train_index, test_index in rs:
    print 'train', train_index, '\ntest', test_index









    



train [ 7 18 23  1 16 15 25 12 11  0 39  9 35 31  6 38  2 32 22 20 10 41 24 28 17
 19  5 30 33 26 13] 
test [36  3  4 34 40 37  8 29 14 27 21]
train [ 7 18 31 30  9 20  5 28  0  4 22 40  1 10 19 23 11 12 41 24 29 33 38  6 17
 16  2 32 27 26 36] 
test [15 14 25 34 13 39 21  8  3 35 37]



In [275]:

    
#Performance of this model is poor because sample size is very small.
for train_index, test_index in rs:
    pipe.fit([counts[i] for i in train_index], [entities[i] for i in train_index])
    print pipe.score([counts[i] for i in test_index],[entities[i] for i in test_index])